Task 3. Class War in Titanic



In [1]:

    
# all imports here

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

import os
import glob
import re
import math
import calendar
import datetime
import random

from calendar import monthrange
from sklearn import datasets, linear_model, ensemble
from sklearn.model_selection import train_test_split

%matplotlib inline



In [2]:

    
DATA_FOLDER = './Data' # Use the data folder provided in Tutorial 02 - Intro to Pandas.("/")

Task 1. Compiling Ebola Data

The DATA_FOLDER/ebola folder contains summarized reports of Ebola cases from three countries (Guinea, Liberia and Sierra Leone) during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.

Use pandas to import these data files into a single Dataframe. Using this DataFrame, calculate for each country, the daily average per month of new cases and deaths. Make sure you handle all the different expressions for new cases and deaths that are used in the reports.

Path extracting

First, we need to collect into an array every file name where the data is located
We will separate the data of each country at first, because they do not follow the same conventions and change the columns name that we are interested in into the same ones for each country.
We'll finally merge them as requested for the rest of the computations.



In [3]:

    
country_folder_names = ['guinea_data', 'liberia_data', 'sl_data']
country_paths = [DATA_FOLDER + '/ebola/' + name for name in country_folder_names]

# all files about the ebola task (by country)
all_ebola_files = [glob.glob(os.path.join(path, "*.csv")) for path in country_paths]

Data reading

Using the paths we extracted above, we read and then merge the data into 3 Dataframes, one for each country. Finally we add a column to each Dataframe containing the name of the country the data comes from



In [4]:

    
guinea_files = all_ebola_files[0]
guinea_df = pd.concat([pd.read_csv(path) for path in guinea_files], axis=0).rename(
    columns={'Date': 'Date', 'Description': 'Info', 'Totals': 'Total'})
guinea_df['Country'] = 'Guinea' 
#'Total deaths of confirmed' | 'Total cases of confirmed'

liberia_files = all_ebola_files[1]
liberia_df = pd.concat([pd.read_csv(path) for path in liberia_files], axis=0).rename(
    columns={'Date': 'Date', 'Variable': 'Info', 'National': 'Total'})
liberia_df['Country'] = 'Liberia' 
# 'Total death/s in confirmed cases' | 'Total confirmed cases'

sl_files = all_ebola_files[2]
sl_df = pd.concat([pd.read_csv(path) for path in sl_files], axis=0).rename(
    columns={'date': 'Date', 'variable': 'Info', 'National': 'Total'})
sl_df['Country'] = 'Sierra Leone'
#'death_confirmed' | 'cum_confirmed'  || totals

We then merge the three Dataframes into one and replace missing values by 0.



In [5]:

    
ebola_df = pd.concat(
    [
        guinea_df,
        liberia_df,
        sl_df
    ],
    axis=0
)

# replace missing values by 0
ebola_df.fillna('0', inplace=True)

The values in the date column are not all in the same format, therefore we need to uniformize their format.
Then, we set the index of the Dataframe into a combination of the country and the date.



In [6]:

    
# unify dates
ebola_df['Date'] = pd.to_datetime(ebola_df['Date'])

# build index
ebola_df.index = [ebola_df['Country'], ebola_df['Date']]
ebola_df.index.rename(['Country', 'Date_index'], inplace=True)



In [7]:

    
# displaying some
ebola_df[:5]









    Out[7]:







  
    
      
      
      34 Military Hospital
      Beyla
      Bo
      Bo EMC
      Boffa
      Bombali
      Bomi County
      Bong County
      Bonthe
      Conakry
      ...
      Sinoe County
      Telimele
      Tonkolili
      Total
      Unnamed: 18
      Western area
      Western area combined
      Western area rural
      Western area urban
      Yomou
    
    
      Country
      Date_index
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Guinea
      2014-08-04
      0
      0
      0
      0
      0
      0
      0
      0
      0
      5
      ...
      0
      0
      0
      5
      0
      0
      0
      0
      0
      0
    
    
      2014-08-04
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2014-08-04
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      ...
      0
      0
      0
      4
      0
      0
      0
      0
      0
      0
    
    
      2014-08-04
      0
      0
      0
      0
      0
      0
      0
      0
      0
      6
      ...
      0
      0
      0
      9
      0
      0
      0
      0
      0
      0
    
    
      2014-08-04
      0
      0
      0
      0
      0
      0
      0
      0
      0
      9
      ...
      0
      0
      0
      11
      0
      0
      0
      0
      0
      0
    
  

5 rows × 65 columns

Assumption

We will assume that for each row the total sum is contained in the column named "Total". Therefore if there is a difference between the value in the column "Total" and the sum of the values in the other columns, the ground truth resides in the value in the column "Total". Based on this assumption, only the columns "Info" and the "Total" will be used.



In [8]:

    
ebola_df = ebola_df[['Total', 'Info']]



In [9]:

    
# displaying some rows
ebola_df[:5]









    Out[9]:







  
    
      
      
      Total
      Info
    
    
      Country
      Date_index
      
      
    
  
  
    
      Guinea
      2014-08-04
      5
      New cases of suspects
    
    
      2014-08-04
      0
      New cases of probables
    
    
      2014-08-04
      4
      New cases of confirmed
    
    
      2014-08-04
      9
      Total new cases registered so far
    
    
      2014-08-04
      11
      Total cases of suspects

Assumptions

We will assume from now on that:

Only data of confirmed cases and deaths is reliable and important. Therefore we will ignore 'suspected' and 'probable' cases in our calculations.
The descriptions 'Total deaths of confirmed', 'Total death/s in confirmed cases' and 'death_confirmed', 'Total cases of confirmed', 'Total confirmed cases' and 'cum_confirmed' are cumulative totals.
The descriptions 'Total deaths of confirmed', 'Total death/s in confirmed cases' and 'death_confirmed' contain data about the number of death of people that we were sure had Ebola.
The descriptions 'Total cases of confirmed', 'Total confirmed cases' and 'cum_confirmed' contain data about the number of cases of people that we are sure have Ebola.

Reasoning

We could have chosen the daily counts instead of cumulative counts. However, they were not always given in all countries which means that we couldn't have used a uniform method of computation for every country.
Moreover, using culumative is sufficient and easier. It is also more consistent.

Following our assumptions, we need to keep the results concerned by the mentionned descriptions only.

Furthermore, some entries containing the data do not only constitute of numbers, but of '%' or ',', those characters need to be removed.



In [10]:

    
deaths_info_to_keep = ['Total deaths of confirmed', 'Total death/s in confirmed cases', 'death_confirmed']
cases_info_to_keep = ['Total cases of confirmed', 'Total confirmed cases', 'cum_confirmed']

ebola_df['Total'] = pd.to_numeric(ebola_df['Total'].replace(',|%', '', regex=True))
ebola_df['Deads'] = np.where(ebola_df['Info'].isin(deaths_info_to_keep), ebola_df['Total'], 0)
ebola_df['Cases'] = np.where(ebola_df['Info'].isin(cases_info_to_keep), ebola_df['Total'], 0)

# displaying some data the dataframe
ebola_df.head(20)









    Out[10]:







  
    
      
      
      Total
      Info
      Deads
      Cases
    
    
      Country
      Date_index
      
      
      
      
    
  
  
    
      Guinea
      2014-08-04
      5.0
      New cases of suspects
      0.0
      0.0
    
    
      2014-08-04
      0.0
      New cases of probables
      0.0
      0.0
    
    
      2014-08-04
      4.0
      New cases of confirmed
      0.0
      0.0
    
    
      2014-08-04
      9.0
      Total new cases registered so far
      0.0
      0.0
    
    
      2014-08-04
      11.0
      Total cases of suspects
      0.0
      0.0
    
    
      2014-08-04
      133.0
      Total cases of probables
      0.0
      0.0
    
    
      2014-08-04
      351.0
      Total cases of confirmed
      0.0
      351.0
    
    
      2014-08-04
      495.0
      Cumulative (confirmed + probable + suspects)
      0.0
      0.0
    
    
      2014-08-04
      2.0
      New deaths registered today
      0.0
      0.0
    
    
      2014-08-04
      2.0
      New deaths registered today (confirmed)
      0.0
      0.0
    
    
      2014-08-04
      0.0
      New deaths registered today (probables)
      0.0
      0.0
    
    
      2014-08-04
      0.0
      New deaths registered today (suspects)
      0.0
      0.0
    
    
      2014-08-04
      2.0
      Total deaths of suspects
      0.0
      0.0
    
    
      2014-08-04
      133.0
      Total deaths of probables
      0.0
      0.0
    
    
      2014-08-04
      228.0
      Total deaths of confirmed
      228.0
      0.0
    
    
      2014-08-04
      363.0
      Total deaths (confirmed + probables + suspects)
      0.0
      0.0
    
    
      2014-08-04
      31.0
      Total PEC center today
      0.0
      0.0
    
    
      2014-08-04
      18.0
      Total PEC center today (confirmed)
      0.0
      0.0
    
    
      2014-08-04
      0.0
      Total PEC center today (probables)
      0.0
      0.0
    
    
      2014-08-04
      13.0
      Total PEC center today (suspects)
      0.0
      0.0



In [11]:

    
countries = ['Guinea', 'Liberia', 'Sierra Leone']
infos = ['Deads', 'Cases']

# we don't need the "Total" and "Info" columns anymore
ebola_infos_df = ebola_df[infos]



In [12]:

    
# plotting data by country

for country in countries:
    ebola_infos_df.loc[country].groupby(['Date_index']).agg(sum).plot(title=country)

As we can see above, lots of data is missing where y = 0, we will need to ignore those point in the future. Moreover, some points seem to be incorrect unless one can resuscitate.

Methods

To get better estimations and ignore unrealistic values, we will create and train a model to detect outliers and create another one to create an extrapolation. We will use a RANSAC Regressor to detect the outliers and an ExtraTreesRegressor to extrapolate the values.

Alternatives

We could have used a LinearRegressor instead of the ExtraTreesRegressor, this would have lead to a better approximation of the trend and it could also compensate human error induced during data fetching. However, we are more interested in the exact daily average given by the data knowing that outliers are taken care of by the RANSAC Regressor and thus the error is minimal enough to be ignored.



In [13]:

    
day_offset = datetime.datetime(2014, 1, 1,)

def days_in_month(year, month):
    return monthrange(year, month)[1]

def days_in_interval(start, end):
    return (end.total_seconds() - start.total_seconds()) / (3600 * 24)

# example : delta of 35 days would return "February"
def days_delta_to_month(days):
    return calendar.month_name[math.floor(days/30) + 1]

Since data is sometimes missing, we then extract the intervals for each country where we have data about the number of cases, and where we have the number of deads.

For each of those intervals:

we get the length in days for each month in the interval.
we create pairs for each month in the interval, where the first item is the earliest day of that month contained in the interval, and the second one, the latest one.



In [14]:

    
def build_interval_by_month(start, end):
    assert(start.year == end.year) # works within the same year only
    assert(start.month < end.month ) # can't go backwards or same month
    
    interval = []
        
    # corner case #1 : start.day is not the first of the month
    interval.append([
        datetime.datetime(start.year, start.month, start.day),
        datetime.datetime(start.year, start.month, days_in_month(start.year, start.month))
    ])
    for month_idx in range(start.month + 1, end.month):
        interval.append([
            datetime.datetime(start.year, month_idx, 1),
            datetime.datetime(start.year, month_idx, days_in_month(start.year, month_idx))
        ])
    # corner case #2 : end.day in not necessary the last of the month
    interval.append([
        datetime.datetime(end.year, end.month, 1),
        datetime.datetime(end.year, end.month, end.day)
    ])
    
    return [[date-day_offset for date in dates]  for dates in interval]



In [15]:

    
intervals_of_interest = {}
                        
for country in countries:
    intervals_of_interest[country] = {}
    for info in infos:
        agg_data = ebola_infos_df.loc[country].groupby(['Date_index']).agg(sum)
        agg_data_greater_zero = agg_data[agg_data[info]>0]
        start = agg_data_greater_zero.index[0]
        end = agg_data_greater_zero.index[-1]
        intervals_of_interest[country][info] = build_interval_by_month(start, end)

To train our models, we created some new features to complement the only one we have. The primary feature being the date in seconds of the record since a previously decided day offset (beginning of the year) divided by 100'000.



In [16]:

    
def get_features(dates):
    X = pd.DataFrame(
        data=[date.total_seconds() for date in dates],
        index=range(len(dates)),
        columns=['Date_index']
    )
    
    X["Date_index"] = X["Date_index"] / 100000.
    X["log"] = X["Date_index"].apply(np.log)
    X["Date1/2"] = X["Date_index"]**(1/2.)
    X["Date^2"] = X["Date_index"]**2
    return X



In [17]:

    
# note : it also plots the predictions 
def get_model(country, info):
    agg_data = ebola_infos_df.loc[country].groupby(['Date_index']).agg(sum)
    agg_data_greater_zero = agg_data[agg_data[info] > 0]
    
    delta = pd.DataFrame(agg_data_greater_zero[info].index)["Date_index"] - day_offset
    
    X = get_features(delta.tolist())
    
    reg1 = linear_model.RANSACRegressor(random_state=1024)
    reg2 = ensemble.ExtraTreesRegressor(random_state=1411)
    
    reg1.fit(X, agg_data_greater_zero[info])
    inlier_mask = reg1.inlier_mask_
    outlier_mask = np.logical_not(inlier_mask)
    x_train, x_test, y_train, y_test = train_test_split(X[inlier_mask], agg_data_greater_zero[info][inlier_mask], test_size=0.0)
    reg2.fit(x_train, y_train)

    pred_df = pd.DataFrame(reg2.predict(X)).rename(columns={0:"Prediction"})
    pred_df["Real Values (zeroes filtered)"] = agg_data_greater_zero[info].values
    pred_df["Date"] = agg_data_greater_zero.index
    pred_df.plot(x="Date", title=country+' - '+info)
    return reg2



In [18]:

    
def plot_info_per_month(intervals, plot_name):
    intervals_df = pd.DataFrame(
        data=[interval[0] for interval in intervals],
        index=[interval[1] for interval in intervals],
    )
    intervals_df.index.name = "Months"
    
    intervals_df.plot(kind="bar", title=plot_name, legend=False)
    return intervals_df



In [19]:

    
pred = []
final_df = pd.DataFrame()
for country in countries:
    for info in infos:
        model = get_model(country, info)
        intervals = []
        for interval in intervals_of_interest[country][info]:
            features = get_features(interval)
            pred.append(model.predict(features))
            if(interval[0] == interval[1]):
                intervals.append([pred[-1][1] - pred[-2][1], days_delta_to_month(interval[0].days)])
            else:
                intervals.append([
                    (pred[-1][1] - pred[-1][0]) / days_in_interval(interval[0], interval[1]),
                     days_delta_to_month(interval[0].days)
                ])
        temp_df = plot_info_per_month(intervals, country + " - " + info + " - daily average per month")
        
        temp_df['Country'] = country
        temp_df['Info'] = info
        final_df = pd.concat([final_df, temp_df], axis=0)
        
final_df.index = [final_df.index, final_df['Country'], final_df['Info']]
final_df = final_df[[0]].rename(
    columns={0: 'Count'})
final_df.groupby(['Country', 'Info', 'Months']).head()









    Out[19]:







  
    
      
      
      
      Count
    
    
      Months
      Country
      Info
      
    
  
  
    
      August
      Guinea
      Deads
      4.037037
    
    
      September
      Guinea
      Deads
      6.634483
    
    
      October
      Guinea
      Deads
      23.000000
    
    
      August
      Guinea
      Cases
      7.851852
    
    
      September
      Guinea
      Cases
      13.144828
    
    
      October
      Guinea
      Cases
      16.000000
    
    
      June
      Liberia
      Deads
      1.400000
    
    
      July
      Liberia
      Deads
      2.933333
    
    
      August
      Liberia
      Deads
      8.146667
    
    
      September
      Liberia
      Deads
      18.551724
    
    
      October
      Liberia
      Deads
      20.600000
    
    
      June
      Liberia
      Cases
      2.428571
    
    
      July
      Liberia
      Cases
      2.266667
    
    
      August
      Liberia
      Cases
      8.783333
    
    
      September
      Liberia
      Cases
      17.896552
    
    
      October
      Liberia
      Cases
      49.480000
    
    
      November
      Liberia
      Cases
      13.179310
    
    
      December
      Liberia
      Cases
      3.500000
    
    
      August
      Sierra Leone
      Deads
      6.473684
    
    
      September
      Sierra Leone
      Deads
      5.310345
    
    
      October
      Sierra Leone
      Deads
      17.933333
    
    
      November
      Sierra Leone
      Deads
      15.544828
    
    
      December
      Sierra Leone
      Deads
      31.800000
    
    
      August
      Sierra Leone
      Cases
      18.947368
    
    
      September
      Sierra Leone
      Cases
      36.862069
    
    
      October
      Sierra Leone
      Cases
      57.466667
    
    
      November
      Sierra Leone
      Cases
      65.810345
    
    
      December
      Sierra Leone
      Cases
      55.000000

Task 2. RNA Sequences

In the DATA_FOLDER/microbiome subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10^th file that describes the content of each.

Use pandas to import the first 9 spreadsheets into a single DataFrame. Then, add the metadata information from the 10^th spreadsheet as columns in the combined DataFrame. Make sure that the final DataFrame has a unique index and all the NaN values have been replaced by the tag unknown.

Our solution

-> We import the metadata into a dataframe

-> We import each data files (MIDx.xls) into dataframes (1 file = 1 dataframe)
note: we handle the first column as row indices and then use it to concat the different dataframes

-> For each data files, we add their respective index that we get from the dataframe of the metadata
note: it's said we must add it as columns, we assumed it was as column names for each data source which makes lots of sense

-> We concat rows of the MIDs dataframes in one single dataframe

-> We replace NaN values by "unknown"

reading all filenames in the folder ending with .xls



In [20]:

    
task2_files = glob.glob(os.path.join(DATA_FOLDER+'/microbiome/', "*.xls"))
task2_files.sort()
task2_files









    Out[20]:





['./Data/microbiome/MID1.xls',
 './Data/microbiome/MID2.xls',
 './Data/microbiome/MID3.xls',
 './Data/microbiome/MID4.xls',
 './Data/microbiome/MID5.xls',
 './Data/microbiome/MID6.xls',
 './Data/microbiome/MID7.xls',
 './Data/microbiome/MID8.xls',
 './Data/microbiome/MID9.xls',
 './Data/microbiome/metadata.xls']

Thanks to the sort, we can see that the metadata is at the end of the list. Hence we extract it using this info.

separating data file paths (0 to 8) and metadata file path (last = 9)



In [21]:

    
metadata_file = task2_files[-1]
mids_files = task2_files[:-1]

importing metadata file into a dataframe and showing it



In [22]:

    
metadata_df = pd.read_excel(metadata_file, index_col=None)

metadata_df









    Out[22]:







  
    
      
      BARCODE
      GROUP
      SAMPLE
    
  
  
    
      0
      MID1
      EXTRACTION CONTROL
      NaN
    
    
      1
      MID2
      NEC 1
      tissue
    
    
      2
      MID3
      Control 1
      tissue
    
    
      3
      MID4
      NEC 2
      tissue
    
    
      4
      MID5
      Control 2
      tissue
    
    
      5
      MID6
      NEC 1
      stool
    
    
      6
      MID7
      Control 1
      stool
    
    
      7
      MID8
      NEC 2
      stool
    
    
      8
      MID9
      Control 2
      stool

note: we can check the that the order of the rows is the same as the order of the files (MID1 -> MID9).

This makes it easy to associate file with its corresponding colum name using indices :



In [23]:

    
mids_df = []
for idx, file in enumerate(mids_files):
    mids_df.append(pd.read_excel(file, index_col=0, header=None))
    mids_df[idx].columns = [[index] for index in metadata_df.loc[range(len(metadata_df)), ["GROUP", "SAMPLE"]].T[idx]]
                   
mids_df[3][:5]









    Out[23]:







  
    
      
      NEC 2
    
    
      
      tissue
    
    
      0
      
    
  
  
    
      Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera
      2
    
    
      Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Sulfurisphaera
      1
    
    
      Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella
      1
    
    
      Archaea "Euryarchaeota" Archaeoglobi Archaeoglobales Archaeoglobaceae Ferroglobus
      1
    
    
      Archaea "Euryarchaeota" Methanopyri Methanopyrales Methanopyraceae Methanopyrus
      10

(above : showing dataframe samples of file MID4.xls)

concat of the 9 dataframes into one (concat by row) and NA are being replaced by "unknown"



In [24]:

    
mids_df_concat = pd.concat(mids_df, axis=1)
mids_df_concat.fillna(value="unknown", inplace=True)

mids_df_concat









    Out[24]:







  
    
      
      EXTRACTION CONTROL
      NEC 1
      Control 1
      NEC 2
      Control 2
      NEC 1
      Control 1
      NEC 2
      Control 2
    
    
      
      NaN
      tissue
      tissue
      tissue
      tissue
      stool
      stool
      stool
      stool
    
  
  
    
      Archaea "Crenarchaeota" Thermoprotei Acidilobales Acidilobaceae Acidilobus
      unknown
      2
      1
      unknown
      5
      unknown
      unknown
      unknown
      unknown
    
    
      Archaea "Crenarchaeota" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera
      unknown
      14
      15
      unknown
      26
      unknown
      1
      unknown
      1
    
    
      Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera
      7
      23
      14
      2
      28
      7
      8
      unknown
      16
    
    
      Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Stetteria
      unknown
      unknown
      unknown
      unknown
      1
      unknown
      unknown
      unknown
      unknown
    
    
      Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Sulfophobococcus
      unknown
      1
      4
      unknown
      5
      1
      2
      unknown
      2
    
    
      Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Thermodiscus
      unknown
      unknown
      1
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
    
    
      Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Thermosphaera
      unknown
      2
      1
      unknown
      2
      unknown
      1
      unknown
      unknown
    
    
      Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Hyperthermus
      unknown
      1
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
    
    
      Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrodictium
      unknown
      unknown
      3
      unknown
      2
      1
      1
      unknown
      5
    
    
      Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus
      2
      2
      unknown
      unknown
      3
      2
      1
      unknown
      unknown
    
    
      Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus
      3
      10
      4
      unknown
      14
      5
      5
      1
      6
    
    
      Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Sulfurisphaera
      unknown
      11
      2
      1
      5
      4
      3
      unknown
      2
    
    
      Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermofilaceae Thermofilum
      3
      9
      5
      unknown
      10
      4
      5
      unknown
      5
    
    
      Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermoproteaceae Caldivirga
      unknown
      5
      3
      unknown
      18
      1
      unknown
      unknown
      3
    
    
      Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermoproteaceae Thermocladium
      unknown
      3
      1
      unknown
      6
      unknown
      unknown
      unknown
      unknown
    
    
      Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella
      7
      9
      7
      1
      17
      12
      18
      unknown
      14
    
    
      Archaea "Euryarchaeota" "Methanomicrobia" Methanomicrobiales Methanomicrobiaceae Methanomicrobium
      unknown
      1
      4
      unknown
      7
      2
      unknown
      unknown
      1
    
    
      Archaea "Euryarchaeota" "Methanomicrobia" Methanomicrobiales Methanomicrobiaceae Methanosphaerula
      unknown
      unknown
      1
      unknown
      unknown
      unknown
      unknown
      unknown
      2
    
    
      Archaea "Euryarchaeota" "Methanomicrobia" Methanomicrobiales Methanomicrobiales_incertae_sedis Methanolinea
      unknown
      unknown
      unknown
      unknown
      1
      unknown
      1
      unknown
      1
    
    
      Archaea "Euryarchaeota" "Methanomicrobia" Methanomicrobiales Methanospirillaceae Methanospirillum
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
      1
      unknown
      unknown
    
    
      Archaea "Euryarchaeota" "Methanomicrobia" Methanosarcinales Methanosarcinaceae Methanimicrococcus
      1
      12
      2
      unknown
      11
      1
      2
      unknown
      6
    
    
      Archaea "Euryarchaeota" "Methanomicrobia" Methanosarcinales Methanosarcinaceae Methanosalsum
      unknown
      unknown
      1
      unknown
      2
      1
      unknown
      unknown
      unknown
    
    
      Archaea "Euryarchaeota" "Methanomicrobia" Methanosarcinales Methermicoccaceae Methermicoccus
      1
      2
      1
      unknown
      3
      1
      2
      unknown
      3
    
    
      Archaea "Euryarchaeota" Archaeoglobi Archaeoglobales Archaeoglobaceae Ferroglobus
      1
      4
      unknown
      1
      8
      4
      9
      unknown
      3
    
    
      Archaea "Euryarchaeota" Archaeoglobi Archaeoglobales Archaeoglobaceae Geoglobus
      1
      unknown
      unknown
      unknown
      1
      unknown
      2
      unknown
      3
    
    
      Archaea "Euryarchaeota" Halobacteria Halobacteriales Halobacteriaceae Haladaptatus
      unknown
      3
      3
      unknown
      1
      unknown
      unknown
      unknown
      unknown
    
    
      Archaea "Euryarchaeota" Halobacteria Halobacteriales Halobacteriaceae Halogeometricum
      unknown
      1
      1
      unknown
      1
      unknown
      unknown
      unknown
      unknown
    
    
      Archaea "Euryarchaeota" Halobacteria Halobacteriales Halobacteriaceae Halopiger
      unknown
      unknown
      unknown
      unknown
      1
      unknown
      unknown
      unknown
      unknown
    
    
      Archaea "Euryarchaeota" Halobacteria Halobacteriales Halobacteriaceae Haloplanus
      1
      4
      3
      unknown
      7
      4
      3
      unknown
      4
    
    
      Archaea "Euryarchaeota" Halobacteria Halobacteriales Halobacteriaceae Haloquadratum
      4
      3
      2
      unknown
      1
      unknown
      unknown
      unknown
      1
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Xanthomonadaceae Rudaea
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
      5
      unknown
      unknown
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Xanthomonadaceae Silanimonas
      unknown
      unknown
      1
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Xanthomonadaceae Stenotrophomonas
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
      80
      unknown
      unknown
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Xanthomonadaceae Wohlfahrtiimonas
      unknown
      unknown
      1
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Xanthomonadaceae Xanthomonas
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
      133
      unknown
      unknown
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Xanthomonadaceae Xylella
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
      1
      unknown
      unknown
    
    
      Bacteria "Spirochaetes" Spirochaetes Spirochaetales Spirochaetales_incertae_sedis Exilispira
      1
      unknown
      unknown
      unknown
      unknown
      1
      unknown
      unknown
      unknown
    
    
      Bacteria "Synergistetes" Synergistia Synergistales Synergistaceae Aminiphilus
      2
      2
      2
      1
      5
      2
      1
      unknown
      unknown
    
    
      Bacteria "Synergistetes" Synergistia Synergistales Synergistaceae Aminomonas
      1
      2
      3
      1
      5
      unknown
      8
      unknown
      3
    
    
      Bacteria "Synergistetes" Synergistia Synergistales Synergistaceae Thermovirga
      1
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
    
    
      Bacteria "Tenericutes" Mollicutes Anaeroplasmatales Anaeroplasmataceae Asteroleplasma
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
      1
    
    
      Bacteria "Tenericutes" Mollicutes Haloplasmatales Haloplasmataceae Haloplasma
      1
      15
      2
      unknown
      unknown
      18
      2
      unknown
      11
    
    
      Bacteria "Thermodesulfobacteria" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Caldimicrobium
      1
      1
      unknown
      unknown
      1
      1
      1
      unknown
      1
    
    
      Bacteria "Thermodesulfobacteria" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Thermodesulfatator
      unknown
      3
      1
      unknown
      1
      2
      3
      1
      1
    
    
      Bacteria "Thermotogae" Thermotogae Thermotogales Thermotogaceae Geotoga
      7
      15
      12
      13
      26
      12
      15
      4
      43
    
    
      Bacteria "Thermotogae" Thermotogae Thermotogales Thermotogaceae Kosmotoga
      9
      22
      7
      4
      36
      11
      16
      2
      16
    
    
      Bacteria "Verrucomicrobia" Opitutae Opitutales Opitutaceae Alterococcus
      1
      unknown
      1
      unknown
      unknown
      unknown
      2
      unknown
      unknown
    
    
      Bacteria "Verrucomicrobia" Opitutae Opitutales Opitutaceae Opitutus
      unknown
      unknown
      1
      unknown
      1
      unknown
      126
      unknown
      1
    
    
      Bacteria "Verrucomicrobia" Opitutae Puniceicoccales Puniceicoccaceae Cerasicoccus
      unknown
      unknown
      1
      unknown
      2
      unknown
      39
      unknown
      unknown
    
    
      Bacteria "Verrucomicrobia" Opitutae Puniceicoccales Puniceicoccaceae Coraliomargarita
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
      2
      unknown
      unknown
    
    
      Bacteria "Verrucomicrobia" Opitutae Puniceicoccales Puniceicoccaceae Puniceicoccus
      unknown
      unknown
      2
      unknown
      4
      unknown
      unknown
      unknown
      unknown
    
    
      Bacteria "Verrucomicrobia" Subdivision3   Subdivision3_genera_incertae_sedis
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
      4
      unknown
      unknown
    
    
      Bacteria "Verrucomicrobia" Verrucomicrobiae Verrucomicrobiales Verrucomicrobiaceae Persicirhabdus
      unknown
      unknown
      2
      unknown
      unknown
      1
      9
      unknown
      unknown
    
    
      Bacteria "Verrucomicrobia" Verrucomicrobiae Verrucomicrobiales Verrucomicrobiaceae Verrucomicrobium
      unknown
      unknown
      1
      unknown
      unknown
      1
      39
      unknown
      unknown
    
    
      Bacteria Cyanobacteria Cyanobacteria  Chloroplast Bangiophyceae
      2
      unknown
      1
      unknown
      2
      unknown
      1
      1
      unknown
    
    
      Bacteria Cyanobacteria Cyanobacteria  Chloroplast Chlorarachniophyceae
      85
      1
      4
      6
      5
      unknown
      2
      4
      1
    
    
      Bacteria Cyanobacteria Cyanobacteria  Chloroplast Streptophyta
      1388
      2
      54
      44
      59
      unknown
      unknown
      unknown
      10
    
    
      Bacteria OD1    OD1_genera_incertae_sedis
      unknown
      unknown
      1
      unknown
      4
      unknown
      unknown
      unknown
      unknown
    
    
      Bacteria OP11    OP11_genera_incertae_sedis
      unknown
      unknown
      26
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
    
    
      Bacteria TM7    TM7_genera_incertae_sedis
      unknown
      2
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
      unknown
    
  

775 rows × 9 columns

(above : showing final dataframe)

Task 3. Class War in Titanic

Use pandas to import the data file Data/titanic.xls. It contains data on all the passengers that travelled on the Titanic.



In [25]:

    
from IPython.core.display import HTML
HTML(filename=DATA_FOLDER+'/titanic.html')









    Out[25]:




Data frame:titanic3
1309 observations and 14 variables, maximum # NAs:1188

Name Labels Units Levels Storage NAs
pclass   3 integer    0
survived Survived double    0
name Name character    0
sex   2 integer    0
age Age Year double  263
sibsp Number of Siblings/Spouses Aboard double    0
parch Number of Parents/Children Aboard double    0
ticket Ticket Number character    0
fare Passenger Fare British Pound (\243) double    1
cabin 187 integer    0
embarked   3 integer    2
boat  28 integer    0
body Body Identification Number double 1188
home.dest Home/Destination character    0



Variable Levels
pclass 1st
2nd
3rd
sex female
male
cabin 
A10
A11
A14
A16
A18
A19
A20
A21
A23
A24
A26
A29
A31
A32
A34
A36
A5
A6
A7
A9
B10
B101
B102
B11
B18
B19
B20
B22
B24
B26
B28
B3
B30
B35
B36
B37
B38
B39
B4
B41
B42
B45
B49
B5
B50
B51 B53 B55
B52 B54 B56
B57 B59 B63 B66
B58 B60
B61
B69
B71
B73
B77
B78
B79
B80
B82 B84
B86
B94
B96 B98
C101
C103
C104
C105
C106
C110
C111
C116
C118
C123
C124
C125
C126
C128
C130
C132
C148
C2
C22 C26
C23 C25 C27
C28
C30
C31
C32
C39
C45
C46
C47
C49
C50
C51
C52
C53
C54
C55 C57
C6
C62 C64
C65
C68
C7
C70
C78
C80
C82
C83
C85
C86
C87
C89
C90
C91
C92
C93
C95
C97
C99
D
D10 D12
D11
D15
D17
D19
D20
D21
D22
D26
D28
D30
D33
D34
D35
D36
D37
D38
D40
D43
D45
D46
D47
D48
D49
D50
D56
D6
D7
D9
E10
E101
E12
E121
E17
E24
E25
E31
E33
E34
E36
E38
E39 E41
E40
E44
E45
E46
E49
E50
E52
E58
E60
E63
E67
E68
E77
E8
F
F E46
F E57
F E69
F G63
F G73
F2
F33
F38
F4
G6
T
embarked Cherbourg
Queenstown
Southampton
boat 
1
10
11
12
13
13 15
13 15 B
14
15
15 16
16
2
3
4
5
5 7
5 9
6
7
8
8 10
9
A
B
C
C D
D

For each of the following questions state clearly your assumptions and discuss your findings:

Describe the type and the value range of each attribute. Indicate and transform the attributes that can be Categorical.
Plot histograms for the travel class, embarkation port, sex and age attributes. For the latter one, use discrete decade intervals.
Calculate the proportion of passengers by cabin floor. Present your results in a pie chart.
For each travel class, calculate the proportion of the passengers that survived. Present your results in pie charts.
Calculate the proportion of the passengers that survived by travel class and sex. Present your results in a single histogram.
Create 2 equally populated age categories and calculate survival proportions by age category, travel class and sex. Present your results in a DataFrame with unique index.

Data loading



In [26]:

    
data = pd.read_excel(DATA_FOLDER+'/titanic.xls')



In [27]:

    
data.head(5)









    Out[27]:







  
    
      
      pclass
      survived
      name
      sex
      age
      sibsp
      parch
      ticket
      fare
      cabin
      embarked
      boat
      body
      home.dest
    
  
  
    
      0
      1
      1
      Allen, Miss. Elisabeth Walton
      female
      29.0000
      0
      0
      24160
      211.3375
      B5
      S
      2
      NaN
      St Louis, MO
    
    
      1
      1
      1
      Allison, Master. Hudson Trevor
      male
      0.9167
      1
      2
      113781
      151.5500
      C22 C26
      S
      11
      NaN
      Montreal, PQ / Chesterville, ON
    
    
      2
      1
      0
      Allison, Miss. Helen Loraine
      female
      2.0000
      1
      2
      113781
      151.5500
      C22 C26
      S
      NaN
      NaN
      Montreal, PQ / Chesterville, ON
    
    
      3
      1
      0
      Allison, Mr. Hudson Joshua Creighton
      male
      30.0000
      1
      2
      113781
      151.5500
      C22 C26
      S
      NaN
      135.0
      Montreal, PQ / Chesterville, ON
    
    
      4
      1
      0
      Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
      female
      25.0000
      1
      2
      113781
      151.5500
      C22 C26
      S
      NaN
      NaN
      Montreal, PQ / Chesterville, ON



In [28]:

    
data.describe(include='all')









    Out[28]:







  
    
      
      pclass
      survived
      name
      sex
      age
      sibsp
      parch
      ticket
      fare
      cabin
      embarked
      boat
      body
      home.dest
    
  
  
    
      count
      1309.000000
      1309.000000
      1309
      1309
      1046.000000
      1309.000000
      1309.000000
      1309
      1308.000000
      295
      1307
      486
      121.000000
      745
    
    
      unique
      NaN
      NaN
      1307
      2
      NaN
      NaN
      NaN
      939
      NaN
      186
      3
      28
      NaN
      369
    
    
      top
      NaN
      NaN
      Kelly, Mr. James
      male
      NaN
      NaN
      NaN
      CA. 2343
      NaN
      C23 C25 C27
      S
      13
      NaN
      New York, NY
    
    
      freq
      NaN
      NaN
      2
      843
      NaN
      NaN
      NaN
      11
      NaN
      6
      914
      39
      NaN
      64
    
    
      mean
      2.294882
      0.381971
      NaN
      NaN
      29.881135
      0.498854
      0.385027
      NaN
      33.295479
      NaN
      NaN
      NaN
      160.809917
      NaN
    
    
      std
      0.837836
      0.486055
      NaN
      NaN
      14.413500
      1.041658
      0.865560
      NaN
      51.758668
      NaN
      NaN
      NaN
      97.696922
      NaN
    
    
      min
      1.000000
      0.000000
      NaN
      NaN
      0.166700
      0.000000
      0.000000
      NaN
      0.000000
      NaN
      NaN
      NaN
      1.000000
      NaN
    
    
      25%
      2.000000
      0.000000
      NaN
      NaN
      21.000000
      0.000000
      0.000000
      NaN
      7.895800
      NaN
      NaN
      NaN
      72.000000
      NaN
    
    
      50%
      3.000000
      0.000000
      NaN
      NaN
      28.000000
      0.000000
      0.000000
      NaN
      14.454200
      NaN
      NaN
      NaN
      155.000000
      NaN
    
    
      75%
      3.000000
      1.000000
      NaN
      NaN
      39.000000
      1.000000
      0.000000
      NaN
      31.275000
      NaN
      NaN
      NaN
      256.000000
      NaN
    
    
      max
      3.000000
      1.000000
      NaN
      NaN
      80.000000
      8.000000
      9.000000
      NaN
      512.329200
      NaN
      NaN
      NaN
      328.000000
      NaN

1. Describe the type and the value range of each attribute. Indicate and transform the attributes that can be `Categorical`.

Data description

pclass: is an Integer between 1 and 3 that represents in which class that passenger was in. It can be categorical because we cannot say that 3rd class is twice as far as 2nd class from 1st class.
survived: is an Integer that tells us whether the passenger survived (1) or died (0). Even though it should be categorical, it is easier to let it non-categorical as it is then possible to sum the values to get the total number of people who survived.
name: is a String that contains the name of the passenger. The title could be extracted from this attribute and be transformed into a categorical attribut but we don't use it here.
sex: is String that is either 'female' or 'male'. It contains the information about the sex of the passenger. It can be categorical.
age: is a Double that represents the age in year of the passenger. It could be transformed into ranges and which could then be considered as categorical.
sibsp: is an Integer that contains the number of siblings and spouse also embarked aboard the Titanic.
parch: is an Integer that contains the number of children and parents also embarked aboard the Titanic.
ticket: is a String that contains the ticket number of the passenger.
fare: is an Integer that represents the price that the passenger paid for its ticket.
cabin: is a String that contains the deck letter and the cabin number of the passenger.
embarked: is a character {S, Q, C} that represent the Port where the passenger embarked.
boat: is the id of the boat that the passenger took after the Titanic crashed into the iceberg.
body: is the number attributed to a dead body when it was found.
home.dest: represents two informations: before the / is the home of the person and after the / is the destination of the person.



In [29]:

    
data['pclass'] = data.pclass.astype('category')
data['sex'] = data.sex.astype('category')
data['embarked'] = data.embarked.astype('category')

2. Plot histograms for the travel class, embarkation port, sex and age attributes. For the latter one, use discrete decade intervals.



In [30]:

    
sns.countplot(x="pclass", data=data, palette="Greens_d")









    Out[30]:





<matplotlib.axes._subplots.AxesSubplot at 0x117739438>



In [31]:

    
sns.countplot(x="embarked", data=data, palette="Greens_d")









    Out[31]:





<matplotlib.axes._subplots.AxesSubplot at 0x1175de0b8>



In [32]:

    
sns.countplot(x="sex", data=data, palette="Greens_d")









    Out[32]:





<matplotlib.axes._subplots.AxesSubplot at 0x11761c208>

For the discrete interval, we decide to divide the ages in 9 intervals: 1 for each decades from 0 to 90 years old. Thanks to the use of panda cut function we do not have to bother about NAs.



In [33]:

    
sns.countplot(x=pd.cut(data.age, [0,10,20,30,40,50,60,70,80,90], right=False), palette="Greens_d")









    Out[33]:





<matplotlib.axes._subplots.AxesSubplot at 0x11789f9e8>

3. Calculate the proportion of passengers by cabin floor. Present your results in a pie chart.

Let's first plot a pie for every values we have for the cabin floors.



In [34]:

    
data['cabin'].astype(str).str[0].value_counts(sort=False).plot(kind='pie')









    Out[34]:





<matplotlib.axes._subplots.AxesSubplot at 0x1120a0c50>

As we can see above, around 3/4 of the data is missing for this attribute. We need to ignore the missing values to have a better view of the pie chart. Note that the cabin floor T might seem to be a typo, we will assume it refers to the tank top that one can find on a titanic deckplan and hence keep the value.



In [35]:

    
data['cabin'].astype(str).str[0].value_counts()[1:].plot(kind='pie')









    Out[35]:





<matplotlib.axes._subplots.AxesSubplot at 0x111ed23c8>

4. For each travel class, calculate the proportion of the passengers that survived. Present your results in pie charts.



In [36]:

    
df=data.groupby("pclass").survived.agg(["sum", lambda x: len(x) - sum(x)])
df.columns=["survived", "dead"]
df=df.T
df.columns=["1st class", "2nd class", "3rd class"]
df.plot(kind="pie", subplots=True, figsize=(18, 6))









    Out[36]:





array([<matplotlib.axes._subplots.AxesSubplot object at 0x1120166d8>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x116fbff98>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x1125d1f98>], dtype=object)

5. Calculate the proportion of the passengers that survived by travel class and sex. Present your results in a single histogram.

Seaborn allows us to plot such a graph. The black bars show the variance.



In [37]:

    
sns.barplot(x="sex", y="survived", hue="pclass", data=data);

6. Create 2 equally populated age categories and calculate survival proportions by age category, travel class and sex. Present your results in a `DataFrame` with unique index.

To split the datas in two equally populated age categories, we use the median and group the data on both side of the median. Obviously some datas will reside in the median and hence we will not get two perfectly equally populated categories but we assume that considering the number of datas we have, we condiser a small delta is acceptable



In [38]:

    
df = data[['pclass', 'sex', 'age', 'survived']][data['age'].isnull() != True]
median_age = df[['age']].median()
df['age_cat'] = np.where(df[['age']] > median_age, '>{}'.format(median_age[0]), '<{}'.format(median_age[0]))
df = df[['pclass', 'sex', 'age_cat', 'survived']]
df.groupby(['age_cat', 'pclass', 'sex']).agg(np.mean)

		34 Military Hospital	Beyla	Bo	Bo EMC	Boffa	Bombali	Bomi County	Bong County	Bonthe	Conakry	...	Sinoe County	Telimele	Tonkolili	Total	Unnamed: 18	Western area	Western area combined	Western area rural	Western area urban	Yomou
Country	Date_index
Guinea	2014-08-04	0	0	0	0	0	0	0	0	0	5	...	0	0	0	5	0	0	0	0	0	0
	2014-08-04	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
	2014-08-04	0	0	0	0	0	0	0	0	0	1	...	0	0	0	4	0	0	0	0	0	0
	2014-08-04	0	0	0	0	0	0	0	0	0	6	...	0	0	0	9	0	0	0	0	0	0
	2014-08-04	0	0	0	0	0	0	0	0	0	9	...	0	0	0	11	0	0	0	0	0	0

			Count
Months	Country	Info
August	Guinea	Deads	4.037037
September	Guinea	Deads	6.634483
October	Guinea	Deads	23.000000
August	Guinea	Cases	7.851852
September	Guinea	Cases	13.144828
October	Guinea	Cases	16.000000
June	Liberia	Deads	1.400000
July	Liberia	Deads	2.933333
August	Liberia	Deads	8.146667
September	Liberia	Deads	18.551724
October	Liberia	Deads	20.600000
June	Liberia	Cases	2.428571
July	Liberia	Cases	2.266667
August	Liberia	Cases	8.783333
September	Liberia	Cases	17.896552
October	Liberia	Cases	49.480000
November	Liberia	Cases	13.179310
December	Liberia	Cases	3.500000
August	Sierra Leone	Deads	6.473684
September	Sierra Leone	Deads	5.310345
October	Sierra Leone	Deads	17.933333
November	Sierra Leone	Deads	15.544828
December	Sierra Leone	Deads	31.800000
August	Sierra Leone	Cases	18.947368
September	Sierra Leone	Cases	36.862069
October	Sierra Leone	Cases	57.466667
November	Sierra Leone	Cases	65.810345
December	Sierra Leone	Cases	55.000000

	BARCODE	GROUP	SAMPLE
0	MID1	EXTRACTION CONTROL	NaN
1	MID2	NEC 1	tissue
2	MID3	Control 1	tissue
3	MID4	NEC 2	tissue
4	MID5	Control 2	tissue
5	MID6	NEC 1	stool
6	MID7	Control 1	stool
7	MID8	NEC 2	stool
8	MID9	Control 2	stool

	NEC 2
	tissue
0
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera	2
Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Sulfurisphaera	1
Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella	1
Archaea "Euryarchaeota" Archaeoglobi Archaeoglobales Archaeoglobaceae Ferroglobus	1
Archaea "Euryarchaeota" Methanopyri Methanopyrales Methanopyraceae Methanopyrus	10

Name	Labels	Units	Levels	Storage	NAs
pclass			3	integer	0
survived	Survived			double	0
name	Name			character	0
sex			2	integer	0
age	Age	Year		double	263
sibsp	Number of Siblings/Spouses Aboard			double	0
parch	Number of Parents/Children Aboard			double	0
ticket	Ticket Number			character	0
fare	Passenger Fare	British Pound (\243)		double	1
cabin			187	integer	0
embarked			3	integer	2
boat			28	integer	0
body	Body Identification Number			double	1188
home.dest	Home/Destination			character	0

Variable	Levels
pclass	1st
	2nd
	3rd
sex	female
	male
cabin
	A10
	A11
	A14
	A16
	A18
	A19
	A20
	A21
	A23
	A24
	A26
	A29
	A31
	A32
	A34
	A36
	A5
	A6
	A7
	A9
	B10
	B101
	B102
	B11
	B18
	B19
	B20
	B22
	B24
	B26
	B28
	B3
	B30
	B35
	B36
	B37
	B38
	B39
	B4
	B41
	B42
	B45
	B49
	B5
	B50
	B51 B53 B55
	B52 B54 B56
	B57 B59 B63 B66
	B58 B60
	B61
	B69
	B71
	B73
	B77
	B78
	B79
	B80
	B82 B84
	B86
	B94
	B96 B98
	C101
	C103
	C104
	C105
	C106
	C110
	C111
	C116
	C118
	C123
	C124
	C125
	C126
	C128
	C130
	C132
	C148
	C2
	C22 C26
	C23 C25 C27
	C28
	C30
	C31
	C32
	C39
	C45
	C46
	C47
	C49
	C50
	C51
	C52
	C53
	C54
	C55 C57
	C6
	C62 C64
	C65
	C68
	C7
	C70
	C78
	C80
	C82
	C83
	C85
	C86
	C87
	C89
	C90
	C91
	C92
	C93
	C95
	C97
	C99
	D
	D10 D12
	D11
	D15
	D17
	D19
	D20
	D21
	D22
	D26
	D28
	D30
	D33
	D34
	D35
	D36
	D37
	D38
	D40
	D43
	D45
	D46
	D47
	D48
	D49
	D50
	D56
	D6
	D7
	D9
	E10
	E101
	E12
	E121
	E17
	E24
	E25
	E31
	E33
	E34
	E36
	E38
	E39 E41
	E40
	E44
	E45
	E46
	E49
	E50
	E52
	E58
	E60
	E63
	E67
	E68
	E77
	E8
	F
	F E46
	F E57
	F E69
	F G63
	F G73
	F2
	F33
	F38
	F4
	G6
	T
embarked	Cherbourg
	Queenstown
	Southampton
boat
	1
	10
	11
	12
	13
	13 15
	13 15 B
	14
	15
	15 16
	16
	2
	3
	4
	5
	5 7
	5 9
	6
	7
	8
	8 10
	9
	A
	B
	C
	C D
	D

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	NaN	135.0	Montreal, PQ / Chesterville, ON
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
count	1309.000000	1309.000000	1309	1309	1046.000000	1309.000000	1309.000000	1309	1308.000000	295	1307	486	121.000000	745
unique	NaN	NaN	1307	2	NaN	NaN	NaN	939	NaN	186	3	28	NaN	369
top	NaN	NaN	Kelly, Mr. James	male	NaN	NaN	NaN	CA. 2343	NaN	C23 C25 C27	S	13	NaN	New York, NY
freq	NaN	NaN	2	843	NaN	NaN	NaN	11	NaN	6	914	39	NaN	64
mean	2.294882	0.381971	NaN	NaN	29.881135	0.498854	0.385027	NaN	33.295479	NaN	NaN	NaN	160.809917	NaN
std	0.837836	0.486055	NaN	NaN	14.413500	1.041658	0.865560	NaN	51.758668	NaN	NaN	NaN	97.696922	NaN
min	1.000000	0.000000	NaN	NaN	0.166700	0.000000	0.000000	NaN	0.000000	NaN	NaN	NaN	1.000000	NaN
25%	2.000000	0.000000	NaN	NaN	21.000000	0.000000	0.000000	NaN	7.895800	NaN	NaN	NaN	72.000000	NaN
50%	3.000000	0.000000	NaN	NaN	28.000000	0.000000	0.000000	NaN	14.454200	NaN	NaN	NaN	155.000000	NaN
75%	3.000000	1.000000	NaN	NaN	39.000000	1.000000	0.000000	NaN	31.275000	NaN	NaN	NaN	256.000000	NaN
max	3.000000	1.000000	NaN	NaN	80.000000	8.000000	9.000000	NaN	512.329200	NaN	NaN	NaN	328.000000	NaN

			survived
age_cat	pclass	sex
<28.0	1	female	0.951220
	1	male	0.531250
	2	female	0.910714
	2	male	0.216216
	3	female	0.513761
	3	male	0.183036
>28.0	1	female	0.967391
	1	male	0.302521
	2	female	0.872340
	2	male	0.083333
	3	female	0.372093
	3	male	0.144000

Table of Contents

Task 1. Compiling Ebola Data

Path extracting

Data reading

Assumption

Assumptions

Reasoning

Methods

Alternatives

Task 2. RNA Sequences

Our solution

Task 3. Class War in Titanic

Data frame:titanic3

Data loading

1. Describe the type and the value range of each attribute. Indicate and transform the attributes that can be Categorical.

Data description

2. Plot histograms for the travel class, embarkation port, sex and age attributes. For the latter one, use discrete decade intervals.

3. Calculate the proportion of passengers by cabin floor. Present your results in a pie chart.

4. For each travel class, calculate the proportion of the passengers that survived. Present your results in pie charts.

5. Calculate the proportion of the passengers that survived by travel class and sex. Present your results in a single histogram.

6. Create 2 equally populated age categories and calculate survival proportions by age category, travel class and sex. Present your results in a DataFrame with unique index.

1. Describe the type and the value range of each attribute. Indicate and transform the attributes that can be `Categorical`.

6. Create 2 equally populated age categories and calculate survival proportions by age category, travel class and sex. Present your results in a `DataFrame` with unique index.